A specialized Winograd Conv2d op #971

bssrdf · 2024-09-29T00:30:38Z

This PR added a new conv2d op using Winograd algorithm.

Currently ggml's conv2d operator uses im2col and GEMM. There have been efforts to speed up this process using other faster algorithms. Winograd is such a method used by many neural network libraries, e.g. Cudnn . For small kernels, e.g. 3x3, Winograd outperforms GEMM based methods. However, efficient implementation of Winograd on GPUs requires significant engineering efforts. This PR 's Winograd implementation specializes in several ways:

It only supports 3x3 kernel
It only supports channel numbers of multiples of 8
It only supports filter output numbers of multiples of 64
It only supports stride=1, padding = 1 and dilation = 1
It only supports CUDA backend

Other features:

Fused, except the kernel transform which requires additional workspace (can be shared with the weights)

It is mainly used for applications such as stable-diffusion.cpp.

The code is based on openCNN project which uses Apache-2.0 license.

Please review and let me know any problems I'll address. Thanks.

… if not satisfied

…results

…FP32 buffer

JohannesGaessler

The code is based on openCNN project which uses Apache-2.0 license.

Did you get permission from the authors to re-license their code as MIT?

src/ggml-cuda.cu

src/ggml-cuda/conv-winograd.cu

JohannesGaessler · 2024-09-29T16:31:52Z

src/ggml-cuda/conv-winograd.cu

+  typedef float(*pointFunction_t)(float *, int);
+
+  template<typename T>
+  __global__ void FX(const T *pInputs, float *pOutputs, int filt_k, 


Suggested change

__global__ void FX(const T *pInputs, float *pOutputs, int filt_k,

__global__ void FX(const T * __restrict__ pInputs, float * __restrict__ pOutputs, int filt_k,

On Pascal this can be a 5x speedup.

JohannesGaessler · 2024-09-29T16:40:05Z

src/ggml-cuda/conv-winograd.cu

+
+}
+
+__device__ __forceinline__ void prefetch_filter_tile(const float *pInputs, float *tiles, int filt_k){


The compiler will rearrange these instructions as it sees fit so there will in effect not be any actual prefetching. For that you need to use asnychronous memcpys (Ampere or newer).

I am not sure here, as this is done by openCNN.

JohannesGaessler · 2024-09-29T16:43:32Z

src/ggml-cuda/conv-winograd.cuh

+__constant__ int access_f_s[2][32];
+__constant__ int access_s[2][32];
+__constant__ int tileid[2][32];


What happens in the case of multiple GPUs? Is the constant memory duplicated across GPUs?

I am pretty ignorant about multi-gpu. I guess they will be duplicated. I don't have a setup to test. Plus, this kernel only works for single GPU, I think.

JohannesGaessler · 2024-09-29T16:45:29Z

src/ggml.c

+static void ggml_compute_forward_winograd_stage0(
+        const struct ggml_compute_params * params,
+              struct ggml_tensor * dst) {
+
+    GGML_ASSERT(false && " CPU backend not implemented!");         
+    return;
+}
+
+static void ggml_compute_forward_winograd_stage1(
+        const struct ggml_compute_params * params,
+              struct ggml_tensor * dst) {
+
+    GGML_ASSERT(false && " CPU backend not implemented!");         
+    return;
+}


If at all possible a CPU implementation should always be done since it serves both as a fallback and as a reference implementation to test other backends against.

A CPU backend should be done, but I am not sure the benefit of it compared to the current im2col+gemm version.

JohannesGaessler · 2024-09-29T16:46:41Z

src/ggml.c

+    bool is_node = false;
+
+    if (a->grad) {
+        is_node = true;
+    }


If #966 is merged first this will need to be removed (should be very straightforward).

Look forward to it...

JohannesGaessler · 2024-09-29T17:16:18Z

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

bssrdf · 2024-09-29T21:19:11Z

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

Thank you for your review, @JohannesGaessler. I leaned a lot from your PRs and comments.

First, I have asked openCNN's author for license issue.

As to performance, I only tested it in SD.cpp as it is developed for it. It is not faster (surprised) than im2col+GEMM with tensor cores (my gpu has them so assuming being used) but definitely not slower. It reduces memory used by VAE quite a lot while increasing UNET param buffer. There is room to further improve its performance as I see several places are not working in an optimal way.

I'll add test cases in test-backend-ops to more rigorously measure performance.

I addressed your other comments above.

slaren · 2024-09-29T21:36:15Z

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing. I don't think this can be merged unless the license situation is resolved.

JohannesGaessler · 2024-09-29T21:46:59Z

Generally speaking my stance regarding this PR would be as follows: I think it's good to have convolution operations instead of having to rely on IM2COL. At the same time I want to have a codebase that is easy to maintain - a central factor for me is that there needs to be some benefit for adding code that offsets the increase in maintenance effort. Quite honestly I think that the starting point from OpenCNN is not very good; I would be rather hesitant to add it since the use cases are limited and I think none of the devs on this project would have a very good understanding of how the code works.

And as slaren said, the licensing issue must be resolved or this is a total non-starter anyways.

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing.

From what I can tell, there is convolution support.

JohannesGaessler · 2024-09-29T22:03:20Z

I'm already tired so maybe I'm just misreading the docs, but I get the impression that NPP convolutions only support 1-4 input channels.

bssrdf · 2024-09-29T23:53:55Z

Thanks to both of you for reviewing. I am not familiar with the license. In case it is not resolvable, I'll ditch this PR.
Now I am putting it in draft mode, hoping to make it work in more general settings, truly serving as an alternative to im2col approach.

JohannesGaessler · 2024-09-30T06:46:38Z

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

bssrdf · 2024-09-30T16:26:55Z

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

If this PR makes into the main, I intend to maintain it long term and improve its performance.

bssrdf added 10 commits September 26, 2024 18:04

added two winograd ops

5b4e448

added more checking for precondition to use winograd; switch to im2co…

68c251b

… if not satisfied

added source code for winograd kernel

2ccc67d

winograd build ok

02a3cb1

test now passed; for some reason, ggml_conv_2d didn't output correct …

3d80466

…results

remove debugging printouts

893ca79

added a FP16 FX kernel to deal with fp16 filter data; no need to use …

6afbf6e

…FP32 buffer

sync from sd.cpp

0491858

restore test-conv2d.cpp test

93c3da7

restore src/CMakeLists.txt

4e8e0d4

JohannesGaessler reviewed Sep 29, 2024

View reviewed changes

bssrdf marked this pull request as draft September 29, 2024 17:40

bssrdf added 2 commits September 29, 2024 17:09

change mask to unsigned int; add __restric__ to various pointers

e0e94c4

fix indentation

c8700ca

bssrdf added 3 commits September 30, 2024 20:41

skip if already computed in a preprocess step

c5d43a2

added winograd conv2d to backend op tests

00ad37e

add conv2d as a test case in test-backend-op

4f93d67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A specialized Winograd Conv2d op #971

A specialized Winograd Conv2d op #971

bssrdf commented Sep 29, 2024 •

edited

Loading

JohannesGaessler left a comment

JohannesGaessler Sep 29, 2024

JohannesGaessler Sep 29, 2024

bssrdf Sep 29, 2024

JohannesGaessler Sep 29, 2024

bssrdf Sep 29, 2024

JohannesGaessler Sep 29, 2024

bssrdf Sep 29, 2024

JohannesGaessler Sep 29, 2024

bssrdf Sep 29, 2024

JohannesGaessler commented Sep 29, 2024

bssrdf commented Sep 29, 2024 •

edited

Loading

slaren commented Sep 29, 2024

JohannesGaessler commented Sep 29, 2024

JohannesGaessler commented Sep 29, 2024 •

edited

Loading

bssrdf commented Sep 29, 2024

JohannesGaessler commented Sep 30, 2024

bssrdf commented Sep 30, 2024

	__global__ void FX(const T pInputs, float pOutputs, int filt_k,
	__global__ void FX(const T * __restrict__ pInputs, float * __restrict__ pOutputs, int filt_k,


		}

		__device__ __forceinline__ void prefetch_filter_tile(const float pInputs, float tiles, int filt_k){

A specialized Winograd Conv2d op #971

Are you sure you want to change the base?

A specialized Winograd Conv2d op #971

Conversation

bssrdf commented Sep 29, 2024 • edited Loading

JohannesGaessler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohannesGaessler commented Sep 29, 2024

bssrdf commented Sep 29, 2024 • edited Loading

slaren commented Sep 29, 2024

JohannesGaessler commented Sep 29, 2024

JohannesGaessler commented Sep 29, 2024 • edited Loading

bssrdf commented Sep 29, 2024

JohannesGaessler commented Sep 30, 2024

bssrdf commented Sep 30, 2024

bssrdf commented Sep 29, 2024 •

edited

Loading

bssrdf commented Sep 29, 2024 •

edited

Loading

JohannesGaessler commented Sep 29, 2024 •

edited

Loading